Differences between PCA and clustering

Because both principal component analysis and clustering are directed at finding patterns in a dataset with the possible goal of providing a means to classify, they may appear identical, but have important differences. The principles of tidy data, in which each column in a dataframe represents a variable and each row represents an observation, help illustrate the difference. PCA is intended to reduce the variables to a small number that will still account for much of the variation in the data. We can use the results to make inferences about how the data points are grouped, but those inferences are not a part of PCA itself.

In contrast, clustering paradigms seek to group observations such that closely-related observations are within the same group while the separation between groups is large. Both “closely-related” and “large separation” are subjective parameters, so the method used to cluster data contains certain a priori assumptions. Furthermore, the number of subgroups ultimately generated can vary depending on how the analysis is applied.

Hierarchical clustering

Hierarchical clustering proceeds in a sequence of steps, leading to a tree-like network (dendrogram) in which all observations are classified. This can be done in one of two ways: agglomerative (bottom-up) clustering or divisive (top-down) clustering. In agglomerative clustering, all observations are separate before the start, and groups of observations are then formed based on similarity. Based on the measure of distance used to determine similarity or difference, the algorithm can then create second-order groups (groups of groups), then third-order groups, and so on until all observations fit into the dendrogram. This approach allows viewing of clustering at different hierarchical levels. Divisive clustering starts with all observations in a single group and then divides them into smaller and smaller groups.

An example of a hierarchical cluster is species taxonomy, in which all species are part of a single tree of life, but the number of clusters depends on what level we examine: genus, order, class, phylum, etc.

Partitional clustering

k-means clustering, a form of partitional clustering, divides the observations into a predetermined number of clusters (k) such that the variation within each cluster is minimized and the variation between the clusters is maximized. The ideal number of clusters to be used is not determined by any set rule, although certain guidelines can be used to optimize the number of clusters. Unlike hierarchical clustering, partitional clustering uses a single step to create clusters that is based on the distances between observations. The result resembles the output from PCA, but the algorithm is evaluating observations, not variables.

k-means clustering uses an iterative process to assign k centroids to the observations, group the data so that each observation is assigned to the closest centroid, then optimize the locations of the centroids to minimize the within-cluster distances. As a result, it can be sensitive to outlier data points. It also tends to create clusters of comparable size. Modified versions of k-means clustering exist that use different approaches to determining distance between observations. For this exercise, however, we will examine how k-means clustering works and can be optimized.

An example of partitional clustering

As an example, we can use the mtcars dataset that is loaded in with base R. The table contains technical data from 1974 on 32 different car models from around the world and is a small but useful dataset to illustrate principles of k-means clustering.

glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

As in PCA, we will scale the data using the scale() function in base R.

mtcars_s <- scale(mtcars)
summary(mtcars_s)
##       mpg               cyl              disp               hp         
##  Min.   :-1.6079   Min.   :-1.225   Min.   :-1.2879   Min.   :-1.3810  
##  1st Qu.:-0.7741   1st Qu.:-1.225   1st Qu.:-0.8867   1st Qu.:-0.7320  
##  Median :-0.1478   Median :-0.105   Median :-0.2777   Median :-0.3455  
##  Mean   : 0.0000   Mean   : 0.000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4495   3rd Qu.: 1.015   3rd Qu.: 0.7688   3rd Qu.: 0.4859  
##  Max.   : 2.2913   Max.   : 1.015   Max.   : 1.9468   Max.   : 2.7466  
##       drat               wt               qsec                vs        
##  Min.   :-1.5646   Min.   :-1.7418   Min.   :-1.87401   Min.   :-0.868  
##  1st Qu.:-0.9661   1st Qu.:-0.6500   1st Qu.:-0.53513   1st Qu.:-0.868  
##  Median : 0.1841   Median : 0.1101   Median :-0.07765   Median :-0.868  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000  
##  3rd Qu.: 0.6049   3rd Qu.: 0.4014   3rd Qu.: 0.58830   3rd Qu.: 1.116  
##  Max.   : 2.4939   Max.   : 2.2553   Max.   : 2.82675   Max.   : 1.116  
##        am               gear              carb        
##  Min.   :-0.8141   Min.   :-0.9318   Min.   :-1.1222  
##  1st Qu.:-0.8141   1st Qu.:-0.9318   1st Qu.:-0.5030  
##  Median :-0.8141   Median : 0.4236   Median :-0.5030  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.1899   3rd Qu.: 0.4236   3rd Qu.: 0.7352  
##  Max.   : 1.1899   Max.   : 1.7789   Max.   : 3.2117

Once the data are scaled, we want to determine the distance between observations. There are many ways to measure distance in a matrix. For the purposes of this exercise, we will use the Euclidean method. get_dist() is a function in the factoextra package.

res_dist <- get_dist(mtcars_s, method = "euclidean")
fviz_dist(res_dist, lab_size = 8)

The eclust() function, also in the factoextra package, incorporates ggplot2 for visualization. It allows several different clustering paradigms; we will use k-means. The nstart argument sets the number of configurations for the algorithm to begin its analysis.

res_kmeans <- eclust(mtcars_s, "kmeans", k.max = 15, nstart = 25)

The function returned 14 clusters. Is this the ideal number? Some subjective decisions will be made in the analysis. To have an idea of how different numbers of clusters lead to different interpretations of the data, we can assign different values to k and see how the groupings change. This can be accomplished using a for loop that applies values of k between 2 and 15.

for (i in 2:15) {eclust(mtcars_s, "kmeans", k = i, nstart = 25)}

The fviz_gap_stat() function can offer an idea of the optimal number of clusters to use. A similar value is stored in the nbclust element of the object generated by eclust().

fviz_gap_stat(res_kmeans$gap_stat)

res_kmeans$nbclust
## [1] 14
fviz_silhouette(res_kmeans)
##    cluster size ave.sil.width
## 1        1    1          0.00
## 2        2    3          0.46
## 3        3    2          0.47
## 4        4    2          0.03
## 5        5    2          0.64
## 6        6    3          0.68
## 7        7    2          0.85
## 8        8    1          0.00
## 9        9    4          0.20
## 10      10    1          0.00
## 11      11    3          0.22
## 12      12    2          0.48
## 13      13    3          0.63
## 14      14    3          0.42

An example of hierarchical clustering

res_hc <- mtcars %>%
  scale() %>%                                              # Scale the data
  dist(method = "euclidean") %>%                           # Compute dissimilarity matrix
  hclust(method = "ward.D2")                               # Compute hierarchical clustering
fviz_dend(res_hc, k = 10,                                  # Use 10 groups
          cex = 0.5,                                       # label size
          color_labels_by_k = TRUE,                        # color labels by groups
          rect = TRUE                                      # Add rectangles around groups
          )
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

mtcars analysis repeated using only columns mpg:qsec

mtcars_s_lim <- mtcars %>% select(mpg:qsec) %>% scale()
res_dist_lim <- get_dist(mtcars_s_lim, method = "euclidean")
fviz_dist(res_dist_lim, lab_size = 8)

res_lim_kmeans <- eclust(mtcars_s_lim, "kmeans", k.max = 15, nstart = 25)

fviz_gap_stat(res_lim_kmeans$gap_stat)

for (i in 2:15) {eclust(mtcars_s_lim, "kmeans", k = i, nstart = 25)}

Work in progress below

The tidymodels package

library(tidymodels)

set.seed(27)                    # ensure a reproducible example

centers <- tibble(
  cluster = factor(1:3), 
  num_points = c(100, 150, 50),  # number points in each cluster
  x1 = c(5, 0, -3),              # x1 coordinate of cluster center
  x2 = c(-1, 1, -2)              # x2 coordinate of cluster center
)

labelled_points <- 
  centers %>%
  mutate(
    x = map2(num_points, x1, rnorm),
    y = map2(num_points, x2, rnorm)
  ) %>% 
  select(-num_points, -x1, -x2) %>% 
  unnest(cols = c(x, y))

ggplot(labelled_points, aes(x, y, color = cluster)) +
  geom_point(alpha = 0.3)

points <- 
  labelled_points %>% 
  select(-cluster)

kclust <- kmeans(points, centers = 3)
kclust
## K-means clustering with 3 clusters of sizes 148, 51, 101
## 
## Cluster means:
##             x         y
## 1  0.08853475  1.045461
## 2 -3.14292460 -2.000043
## 3  5.00401249 -1.045811
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
## [260] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [297] 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 298.9415 108.8112 243.2092
##  (between_SS / total_SS =  82.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
summary(kclust)
##              Length Class  Mode   
## cluster      300    -none- numeric
## centers        6    -none- numeric
## totss          1    -none- numeric
## withinss       3    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           3    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric